I first read in the raw data:
countData<-read.xlsx("~/Documents/VIB/Projects/Integrative_Paris/National_cohort/RNAseq/Global Table Count Cryostem 01.2019.xlsx")
colnames(countData) <- as.character(countData[1,]) # set patient names as colnames
rownames(countData) <- countData$ID # set gene names as rownames
countData_withinfo <- countData[,-1]
I can then look at the extra information on the patients’ unmapped genes, lib size, …
## [1] "D1710"
## [1] 60433 136
I then load in the clinical data, and reorder the RNAseq and clinical data tables so that they are in the same order:
I then identify the donors and recipients, for which we have complete cases:
## [1] "We have 62 complete couples in the RNA seq data of the Cryostem cohort."
Finally, I generate the age and gender vectors that will be used for modeling in all patients (donors, recipients and couples):
## [1] "M"
## [1] "F"
## [1] 19529
## COUPLENUMBER Gender DOG DOB GROUP RIN
## D4147 1 M 2015-07-23 1960-03-31 Non_Tolerant 8.3
## R4147 1 F 2015-07-23 1962-02-02 Non_Tolerant 7.6
## D3574 10 F 2015-04-10 1950-09-17 Primary_tolerant 6.5
## R3574 10 F 2015-04-10 1968-11-26 Primary_tolerant 7.8
## D557 11 F 2013-05-29 1982-01-16 Primary_tolerant 7.9
## R557 11 F 2013-05-29 1980-07-12 Primary_tolerant 7.3
## CMVStatus gender_comp age_recip
## D4147 1 MF 19529
## R4147 1 MF 19529
## D3574 1 FF 16936
## R3574 1 FF 16936
## D557 0 FF 12009
## R557 0 FF 12009
I will first focus on the recipients only, and perform a standard preprocessing (keep genes with more than one count per million in at least three patients):
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
Save the produced objects:
Load the produced data :
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/expression_table_recip.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/data_for_DE_recip.RData")
Fit a model to the design:
## Contrasts
## Levels group1 group2 group3
## non_tolerant 1 1 0
## primary_tolerant -1 0 1
## secondary_tolerant 0 -1 -1
After adjusting the pvalues for multiple testing, we can see how many genes were identified as differentially expressed between the different tolerance groups:
#### Adjust P-values via Benjamini-Hochberg
##### 1. non-tol vs tol-1
allGenesGroup1<-topTable(fit.eb, adjust="BH", sort.by="P",number=Inf, coef=1)
DEgenesGroup1<-getDEgenes(allGenesGroup1,0.05,1)
paste0("Non tolerant vs primary: ", dim(DEgenesGroup1)[1])
## [1] "Non tolerant vs primary: 0"
##84
##### 2. non-tol vs tol-2
allGenesGroup2<-topTable(fit.eb, adjust="BH", sort.by="P",number=Inf, coef=2)
DEgenesGroup2<-getDEgenes(allGenesGroup2,0.05,1)
paste0("Non tolerant vs secondary: ", dim(DEgenesGroup2)[1])
## [1] "Non tolerant vs secondary: 0"
##57
##### 3. tol-1 vs tol-2
allGenesGroup3<-topTable(fit.eb, adjust="BH", sort.by="P",number=Inf, coef=3)
DEgenesGroup3<-getDEgenes(allGenesGroup3,0.05,1)
paste0("Primary vs secondary: ", dim(DEgenesGroup3)[1])
## [1] "Primary vs secondary: 0"
##0
No genes were identified as differentially expressed between the different groups, in the Cryostem cohort.
filter out lowly variable genes:
Prepare the data:
We perform a PCA on the recipients to see if we already observe some structure in the data, and if there isn’t too much batch effect: We don’t observe much structure, relatively to the tolerance groups, or batches.
We can try to build a random forest model using all the genes of the recipients:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 6 12 1
## primary_tolerant 7 20 2
## secondary_tolerant 1 10 3
## [1] 53.22581
But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:
First, we try to separate tolerant from non tolerant recipients:
set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 32.26%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 3 16 0.84210526
## tolerant 4 39 0.09302326
Second, we can try to separate primary tolerant from secondary tolerant recipients:
set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 37.21%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 25 4 0.1379310
## secondary_tolerant 12 2 0.8571429
But it seems like we can’t classify between these two groups of patients yet.
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## Note: zip::zip() is deprecated, please use zip::zipr() instead
## [1] "1264 genes were selected with a 0.95 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "140 genes were common between the two cohorts."
Graph of the 140 selected genes that were found commonly between the two cohorts:
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "111 genes out of the 140 common selected genes have the same behaviour in both cohorts."
Graph of the 111 genes that were found commonly in the two cohorts AND had the same behaviour in the two cohorts:
selb <- as.character(sel_beh$sel_beh)
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#0000FF80"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
I save the 111 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolerant patients)
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_recip_cryo_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 7 12
## tolerant 8 35
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 32.258064516129"
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 278 selected features isn’t better than the one built on the 140 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 35.48%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 5 14 0.7368421
## tolerant 8 35 0.1860465
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "2492 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
sel_louis <- read.xlsx("../data/RNAseq/recip_only/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
sel_common <- which(selected_genes_qt %in% sel_louis$X1)
print(paste0(length(sel_common), " genes were common between the two cohorts."))
## [1] "408 genes were common between the two cohorts."
# selected_genes_qt[sel_common]
Graph of these selected genes that were found commonly between the two cohorts:
I save the 408 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "278 genes out of the 408 common selected genes have the same behaviour in both cohorts."
Graph of the 278 genes that were found commonly in the two cohorts AND had the same behaviour in the two cohorts:
selb <- as.character(sel_beh$sel_beh)
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#FF000080", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#0000FF80"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
I save the 278 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
Now that we have selected features that should help us classify tolerant and non tolerant recipients, we can see that a RF model built on these selected features is slightly better than the one built on all genes:
set.seed(1)
rf <- rf_tol_non_tol(df_orig = rna_recip_cryo_TNT_90, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 30.65%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 6 13 0.6842105
## tolerant 6 37 0.1395349
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 278 features. But the RF model built on the 278 selected features isn’t better than the one built on the 408 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 33.87%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 6 13 0.6842105
## tolerant 8 35 0.1860465
Of the 408 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
Most of the selected genes still seem to add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.9)]
## [1] "ENSG00000215014" "TTC34" "SDC3"
## [4] "FCGR1CP" "IL10" "C1orf35"
## [7] "ENSG00000272735" "PAIP2B" "DYSF"
## [10] "LOXL3" "TNS1" "CTDSPL"
## [13] "LINC01215" "LINC02021" "MYL5"
## [16] "PARM1" "LOC100507053" "MARCH3"
## [19] "ABLIM3" "PDGFRB" "GPRIN1"
## [22] "FLT4" "HIST1H4A" "ENSG00000218713"
## [25] "ENSG00000222078" "GPER1" "MBLAC1"
## [28] "NRCAM" "FAM131B" "ATP6V0E2"
## [31] "ASIC3" "SLC25A6" "ARHGAP6"
## [34] "MAOB" "FBP1" "LCN2"
## [37] "SAPCD2" "LOC171391" "TOLLIP.AS1"
## [40] "RIC3" "SPON1" "PC"
## [43] "STARD10" "ZEB1" "PSD"
## [46] "FBXL15" "SSPN" "ENSG00000260329"
## [49] "SLC22A17" "CMTM5" "TRMT61A"
## [52] "ENSG00000259845" "ENSG00000277142" "CACNA1H"
## [55] "EME2" "SNHG9" "DNAJA3"
## [58] "ARMC5" "TGFB1I1" "ENSG00000278909"
## [61] "FBXL8" "PARD6A" "APRT"
## [64] "SH3GL1P1" "PLXDC1" "HSF5"
## [67] "ENGASE" "ENSG00000262580" "NPTX1"
## [70] "RFNG" "HAR1A" "C20orf204"
## [73] "LOC105372233" "IZUMO4" "ICAM3"
## [76] "QTRT1" "TMEM205" "JUNB"
## [79] "ENSG00000267670" "SLC27A1" "LOC102723540"
## [82] "GRAMD1A" "CIC" "ADM5"
## [85] "SIGLEC11" "RPS9" "GP6"
## [88] "ZNF628" "C19orf18" "SLC27A5"
## [91] "TXNRD2" "PI4KAP2" "TCN2"
## [94] "ENSG00000226450" "ENSG00000273253" "SELENOO"
## [97] "MAPK11" "ENSG00000184441" "COL6A1"
## [100] "ENSG00000211459"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 408 selected genes were most linked to gender?
Boxplots of these most gender related genes:
Which of the 408 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/recip_only/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "367 genes were above the 0.95 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "3 genes were common between the two cohorts: "
## [1] "ENSG00000237818" "KLRC2" "ENSG00000251876"
I save these features:
Boxplots of these common metabolites:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "1 genes out of the 3 common selected genes has the same behaviour in both cohorts."
I save the 1 feature in an excel file: The excel file contains, for every gene, the associated logFC, pvalue…, where the comparison is in primary versus secondary tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolerant patients)
Now that we have selected features that should help us classify non and tolerant recipients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_recip_cryo_PS_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 21 8
## secondary_tolerant 11 3
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 44.1860465116279"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "907 genes were above the 0.95 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "45 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
I save the 45 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "24 genes out of the 45 common selected genes have the same behaviour in both cohorts."
Graph of the 24 genes that were found commonly in the two cohorts AND had the same behaviour in the two cohorts:
selb <- as.character(sel_beh$sel_beh)
correlations <- norm_data[,selb] %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
dplyr::filter(var1 < var2)
correlated_features <- correlations %>%
dplyr::filter(value > 0.7)
gr <- graph_from_data_frame(correlated_features, directed = FALSE, vertices=selb)
nodes <- get.vertex.attribute(gr)
color_nodes <- rep("#0000FF80", length(nodes$name))
medians <- gr_medians[,nodes$name]
for (gene in seq_along(colnames(medians))){
if (medians["primary_tolerant", gene] == max(medians[, gene])){
color_nodes[gene] <- "#00FF0080"
}
}
# recip_path <- "~/Desktop/"
#
# pdf(paste0(recip_path, "cryo_common_metabolites_recip_05.pdf"), width = 12, height = 8)
set.seed(1)
plot(gr, vertex.size = 7, vertex.label.cex = .5, edge.width = (E(gr)$value - .7) * 20,
vertex.color = color_nodes, layout = layout_nicely)
# dev.off()
I save the 24 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolrant patients)
Now that we have selected 45 features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is slightly better than the one built on all genes:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_recip_cryo_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 28 1
## secondary_tolerant 10 4
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 25.5813953488372"
We can also try to use only the 24 features that were common and behaved similarly in the two cohorts. But the RF model built on the 24 selected features isn’t better than the one built on the 44 common genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 27 2
## secondary_tolerant 9 5
## [1] "prediction error: 25.5813953488372"
Of the 45 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
Some of the selected genes still seem to add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.8)]
## [1] "CA2" "ENSG00000256658" "TOP2A" "MAPT"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 45 selected genes were most linked to gender?
Boxplot of the most gender related gene:
Which of the 45 selected genes were most linked to age?
## [1] 1
We can plot the genes that seemed to be most linked to the age of the recipients:
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
Save the produced objects:
Load the produced data :
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/expression_table_donors.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/data_for_DE_donors.RData")
Fit a model to the design:
## Contrasts
## Levels group1 group2 group3
## non_tolerant 1 1 0
## primary_tolerant -1 0 1
## secondary_tolerant 0 -1 -1
Adjust the pvalues for multiple testing : Number of DE genes found for the three groups
## [1] "Non tolerant vs primary: 0"
## [1] "Non tolerant vs secondary: 0"
## [1] "Primary vs secondary: 0"
filter out lowly variable genes:
Prepare the data:
We perform a PCA on the donors to see if we already observe some structure in the data, and if there isn’t too much batch effect: We don’t observe much structure on the PCA.
We can try to build a random forest model using all the genes of the donors:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 6 13 0
## primary_tolerant 5 23 1
## secondary_tolerant 3 11 0
## [1] 53.22581
But it seems like the data doesn’t allow us to classify into the three tolerance groups. We can try to simplify the classification problem:
First, we try to separate tolerant from non tolerant donors:
set.seed(1)
rf <- rf_tol_non_tol(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 32.26%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 2 17 0.89473684
## tolerant 3 40 0.06976744
Second, we can try to classify primary tolerant from secondary tolerant donors:
set.seed(1)
rf <- rf_tol1_tol2(df_orig = norm_data, ntree = 5000)
rf
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 34.88%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 28 1 0.03448276
## secondary_tolerant 14 0 1.00000000
But the error rates are still very high, even after splitting the classification problem in two.
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "675 genes were above the 0.95 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "70 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "5 genes out of the 70 common selected genes have the same behaviour in both cohorts."
Boxplots of thes 5 common genes:
I save the 5 features in an excel file: The excel file contains information on which features were correlated in the graph above. It also contains, for every gene, the associated logFC, pvalue…, where the comparison is in non tolerant versus tolerant patients (eg, a positive Fold Change for a gene therefore means that the gene is overexpressed in non tolerant patients)
Now that we have selected features that should help us classify non and tolerant donorients, we can see that a RF model built on these selected features gives slightly better results than the one built on all features:
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donor_cryo_TNT_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true non_tolerant tolerant
## non_tolerant 5 14
## tolerant 4 39
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 29.0322580645161"
We can also try to use only the features that were common and behaved similarly in the two cohorts, we can also build a classifier on these 111 features. But the RF model built on the 5 selected features isn’t better than the one built on the 70 common genes:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 32.26%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 8 11 0.5789474
## tolerant 9 34 0.2093023
These genes were selected as informative to classify the non and tolerant donorients, with a 0.90 threshold:
## [1] "1258 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "310 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "22 genes out of the 310 common selected genes have the same behaviour in both cohorts."
Graph of these 22 selected genes that were found commonly between the two cohorts:
I save these features:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true non_tolerant tolerant
## non_tolerant 5 14
## tolerant 3 40
## [1] "prediction error: 27.4193548387097"
We can do the same on the 22 genes that had the same behaviour in both cohorts, but it doesn’t seem to improve:
## predicted
## true non_tolerant tolerant
## non_tolerant 7 12
## tolerant 5 38
## [1] "prediction error: 27.4193548387097"
Of the 310 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (tail of the second “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.9)]
## [1] "ENSG00000189195" "PSD4" "ENSG00000269982"
## [4] "ENSG00000250574" "NIPAL4" "ENSG00000231769"
## [7] "FAM200A" "NRCAM" "AKNA"
## [10] "ENSG00000255097" "CD69" "SMUG1"
## [13] "ENSG00000258875" "MNS1" "ITGAL"
## [16] "ENSG00000279294" "KEAP1" "ETFB"
## [19] "ENSG00000268520" "ADORA2A.AS1"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 310 selected genes were most linked to gender?
## [1] "ENSG00000231769"
Boxplots of these most gender related genes:
Which of the 310 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_only/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "441 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "26 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
I save these features:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "6 genes out of the 26 common selected genes have the same behaviour in both cohorts."
Graph of the 6 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 25 4
## secondary_tolerant 8 6
## [1] "prediction error: 27.906976744186"
We can also build a RF model using only the 6 genes that had the same behaviour in both cohorts, but the classification resutls are slightly worse:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 25 4
## secondary_tolerant 10 4
## [1] "prediction error: 32.5581395348837"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "799 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "104 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "16 genes out of the 104 common selected genes have the same behaviour in both cohorts."
Graph of the 16 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 25 4
## secondary_tolerant 9 5
## [1] "prediction error: 30.2325581395349"
We can also build a RF only on the 16 genes that had the same behaviour in both cohorts, but this doesn’t improve the results:
rna_donors_cryo_beh_PS_90 <- readRDS("../data/RNA_seq_national/donors_only/rna_donor_cryo_beh_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_donors_cryo_beh_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 25 4
## secondary_tolerant 9 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 30.2325581395349"
Of the 104 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A few genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.9)]
## [1] "ENSG00000187536" "LINC00964" "AGAP12P"
## [4] "ZMIZ1.AS1" "ENSG00000211801" "ENSG00000211805"
## [7] "ENSG00000211809" "ENSG00000211810" "ENSG00000211863"
## [10] "ENSG00000211884" "BCL2A1" "ENSG00000279722"
## [13] "UTY"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 104 selected genes were most linked to gender?
Boxplots of these most gender related genes:
Which of the 104 selected genes were most linked to age?
## [1] 17 32 41 77
We can plot the genes that seemed to be most linked to the age of the recipients:
## [1] 19510 124
The names of the genes are in the ensembl ID format. When possible, I replace them with the associated gene symbol:
Perform normalisation and initiate the design matrix (per group)
Save the produced objects:
Load the data :
## Adding missing grouping variables: `COUPLENUMBER`
filter out lowly variable genes:
Prepare the data:
We can apply a principal components analysis to see if there is some structure in the data that can already be seen:
In the following PCA plot, we plot the recipients as black dots, the donors as white dots, and we link donors and recipients that belong to the same couple by a line of the color of their group.
It looks like, in the RNAseq data of the Cryostem cohort, the distances between non tolerant donors and recipients are not really bigger than the distances between primary or secondary tolerant donors and recipients.
We can already try to classify the couples based on all the genes:
## predicted
## true non_tolerant primary_tolerant secondary_tolerant
## non_tolerant 3 15 1
## primary_tolerant 5 22 2
## secondary_tolerant 3 10 1
## [1] 0.5806452
Trying to classify between non tolerant and tolerant couples:
##
## Call:
## randomForest(formula = group ~ ., data = df_tmp, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 30.65%
## Confusion matrix:
## non_tolerant tolerant class.error
## non_tolerant 1 18 0.94736842
## tolerant 1 42 0.02325581
Trying to classify only primary and secondary tolerant couples:
##
## Call:
## randomForest(formula = group ~ ., data = df_tol, mtry = mtry, ntree = ntree, importance = T, proximity = T)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 37.21%
## Confusion matrix:
## primary_tolerant secondary_tolerant class.error
## primary_tolerant 27 2 0.06896552
## secondary_tolerant 14 0 1.00000000
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_and_recip/perm_vals_TNT.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_and_recip/norm_data_TNT.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant couples, with a 0.95 threshold:
## [1] "476 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "68 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "28 genes out of the 68 common selected genes have the same behaviour in both cohorts."
Graph of the 28 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true non_tolerant tolerant
## non_tolerant 7 12
## tolerant 3 40
## [1] "prediction error: 24.1935483870968"
We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which doesn’t improve the classification:
## predicted
## true non_tolerant tolerant
## non_tolerant 9 10
## tolerant 7 36
## [1] "prediction error: 27.4193548387097"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "1046 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "196 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "87 genes out of the 196 common selected genes have the same behaviour in both cohorts."
Graph of the 87 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant recipients, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true non_tolerant tolerant
## non_tolerant 7 12
## tolerant 4 39
## [1] "prediction error: 25.8064516129032"
We can also build a RF only on the 28 genes that had the same behaviour in the two cohorts, which doesn’t improve the classification:
## predicted
## true non_tolerant tolerant
## non_tolerant 6 13
## tolerant 7 36
## [1] "prediction error: 32.258064516129"
Of the 196 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A lot of genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
## [1] "subst_SLC44A5" "subst_ENSG00000224093"
## [3] "subst_PDZK1" "subst_FCER1A"
## [5] "subst_C1orf21" "subst_ENSG00000230470"
## [7] "subst_LINC00298" "subst_LINC00299"
## [9] "subst_NPHP1" "subst_ENSG00000272282"
## [11] "subst_EOMES" "subst_GZMK"
## [13] "subst_PRR16" "subst_SH3RF2"
## [15] "subst_ENSG00000248568" "subst_HRH2"
## [17] "subst_ADAMTS2" "subst_C2"
## [19] "subst_PTCRA" "subst_ENPP5"
## [21] "subst_RAB23" "subst_ENSG00000227678"
## [23] "subst_GPER1" "subst_AOAH.IT1"
## [25] "subst_ENSG00000225314" "subst_ENSG00000261334"
## [27] "subst_ENSG00000233013" "subst_BATF2"
## [29] "subst_PRSS23" "subst_ARHGAP20"
## [31] "subst_KLRD1" "subst_ENSG00000255819"
## [33] "subst_KLRC4" "subst_KLRA1P"
## [35] "subst_IFNG.AS1" "subst_LINC02384"
## [37] "subst_ENSG00000214215" "subst_GJB6"
## [39] "subst_PTGDR" "subst_ADGRG1"
## [41] "subst_LINC00482" "subst_BAHCC1"
## [43] "subst_B4GALT6" "subst_LOC102724545"
## [45] "subst_S1PR5" "subst_KIAA1671"
## [47] "subst_TCN2"
Which of the 196 selected genes were most linked to gender?
Boxplots of these most gender related genes:
Which of the 196 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Run the permutation distribution, taking demographic variables into account: I first add the age and gender compatibility information to the data.
And then compute the permutation distribution for the non vs tolerance:
Compute the median per group for every gene, to use it later in the graphs:
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_and_recip/perm_vals_PS.RData")
load("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/RNA_seq_national/donors_and_recip/norm_data_PS.RData")
gr_medians <- norm_data %>%
group_by(group) %>%
summarize_if(is.numeric, median) %>%
as.data.frame()
rownames(gr_medians) <- gr_medians$group
These genes were selected as informative to classify the non and tolerant recipients, with a 0.95 threshold:
## [1] "1320 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "111 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
colData_orig <- colData
# rownames(norm_data) are couple numbers -> I change the rownames of colData
# to couples as well so that I can keep using my functions
colData <- colData_orig
colData <- colData %>%
dplyr::filter(grepl("R", rownames(colData))) %>%
mutate(COUPLE = paste0("couple_", COUPLENUMBER))
rownames(colData) <- colData$COUPLE
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "6 genes out of the 111 common selected genes have the same behaviour in both cohorts."
Graph of the 6 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant couples, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 26 3
## secondary_tolerant 8 6
## [1] "prediction error: 25.5813953488372"
We can also build a RF only on the 6 genes that had the same behaviour in both cohorts, but this doesn’t improve the results:
rna_dr_cryo_beh_PS_95 <- readRDS("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_beh_PS_95.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_cryo_beh_PS_95, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 24 5
## secondary_tolerant 11 3
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 37.2093023255814"
These genes were selected as informative to classify the non and tolerant recipients, with a 0.90 threshold:
## [1] "2211 genes were above the 0.90 threshold in the Cryostem cohort."
Which of these selected genes are common with the Cryostem cohort?
## [1] "337 genes were common between the two cohorts."
Graph of these selected genes that were found commonly between the two cohorts:
We can see how many of these genes had the same behaviour in the two cohorts (ie, were over or underexpressed in the same group of patients)
## [1] "15 genes out of the 337 common selected genes have the same behaviour in both cohorts."
Graph of the 15 genes that had the same behaviour in the two cohorts:
Now that we have selected features that should help us classify primary and secondary tolerant couples, we can see that a RF model built on these selected features is better than the one built on all genes:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 26 3
## secondary_tolerant 9 5
## [1] "prediction error: 27.906976744186"
We can also build a RF only on the 15 genes that had the same behaviour in both cohorts, but this doesn’t improve the results:
rna_dr_cryo_beh_PS_90 <- readRDS("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_beh_PS_90.RDS")
set.seed(1)
rf <- ranger::ranger(group~., data = rna_dr_cryo_beh_PS_90, num.trees = 5000, importance = "impurity")
rf$confusion.matrix
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 25 4
## secondary_tolerant 9 5
paste0("prediction error: ", rf$prediction.error*100)
## [1] "prediction error: 30.2325581395349"
Of the 337 genes that were selected as informative on tolerance in both cohorts, which ones are still high after correcting for age and gender?
A lot of genes (third “bump” in the density plot) still add information on the tolerance even after correction for gender and age:
names(qt_sel)[which(qt_sel>0.90)]
## [1] "subst_MIR30E" "subst_KIAA1614.AS1"
## [3] "subst_LOC102724919" "subst_BATF3"
## [5] "subst_ENSG00000187536" "subst_ENSG00000223922"
## [7] "subst_ENSG00000233757" "subst_CCDC173"
## [9] "subst_ENSG00000281100" "subst_PROK2"
## [11] "subst_LIPH" "subst_LINC01262"
## [13] "subst_ENSG00000251215" "subst_ENSG00000271795"
## [15] "subst_ENSG00000272053" "subst_SCIN"
## [17] "subst_ARHGEF34P" "subst_GIMAP7"
## [19] "subst_GIMAP4" "subst_ENSG00000270990"
## [21] "subst_GIMAP1" "subst_ENSG00000250569"
## [23] "subst_ENSG00000227730" "subst_ENSG00000227388"
## [25] "subst_ENSG00000235641" "subst_ENSG00000254463"
## [27] "subst_PLEKHB1" "subst_ENSG00000271022"
## [29] "subst_HTRA1" "subst_GRASP"
## [31] "subst_TPTE2P5" "subst_ENSG00000211778"
## [33] "subst_ENSG00000211795" "subst_ENSG00000211816"
## [35] "subst_ENSG00000211884" "subst_ENSG00000211885"
## [37] "subst_ENSG00000247287" "subst_ENSG00000260810"
## [39] "subst_LINC02325" "subst_SNORD116.18"
## [41] "subst_C15orf53" "subst_BCL2A1"
## [43] "subst_ENSG00000251819" "subst_ENSG00000262621"
## [45] "subst_ENSG00000279620" "subst_LINC02175"
## [47] "subst_TOP2A" "subst_EME1"
## [49] "subst_ENSG00000259349" "subst_LINC01730"
## [51] "subst_SSTR3" "subst_ENSG00000279720"
## [53] "subst_ENSG00000239179" "subst_LINC00649"
order_demo_corr <- sel_com_names[order(qt_sel, decreasing=T)]
Which of the 337 selected genes were most linked to gender?
Boxplots of these most gender related genes:
Which of the 337 selected genes were most linked to age?
We can plot the genes that seemed to be most linked to the age of the recipients:
Now that we have selected features at the couple level, at the recipient level and at the donor level, we can try to combine these features:
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), even though the two groups are quite mixed, which suggests that it might still be difficult to classify the patients into tolerant and non tolerant groups based on the features we selected.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 5 14
## tolerant 5 38
## [1] "prediction error: 30.6451612903226"
The classification error is quite high, but we can still see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## subst_LOC100289473 D_ENSG00000249921 R_MGP
## 0.3832490 0.3746105 0.3690787
## subst_BATF2 subst_CCR3 subst_SLC41A2
## 0.3619522 0.2892047 0.2621836
## subst_RHOBTB3 D_CCR9 R_SLC41A2
## 0.2613744 0.2489799 0.2430470
## subst_ENSG00000260528 subst_KLRD1 subst_ENSG00000269895
## 0.2397840 0.2319027 0.2312693
## D_LOC283177 R_ENSG00000276791 subst_CEP78
## 0.2279287 0.2218172 0.2050951
## subst_PTGDR subst_C2 subst_LINC00482
## 0.1983707 0.1965828 0.1935969
## D_GIMAP1 D_ENSG00000268520
## 0.1897815 0.1879739
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap: In the first heatmap, the patients (in rows) are ordered by group and the 20 top genes (in columns) are ordered by importance in the random forest model. In the second heatmap, the patients and 20 top genes are clustered hierarchically: the patients that are most similar are closer to each other, same for the genes.
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), even though the two groups are quite mixed, which suggests that it might still be difficult to classify the patients into tolerant and non tolerant groups based on the features we selected.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 3 16
## tolerant 5 38
## [1] "prediction error: 33.8709677419355"
The classification error is quite high, but we can still see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## D_NUAK1 R_UBE2Q2P2 R_ARHGAP20
## 0.3133910 0.2135810 0.2064745
## subst_LOC100289473 subst_ENSG00000260528 subst_RHOBTB3
## 0.2062774 0.1903459 0.1901955
## D_ENSG00000255186 D_ENSG00000175841 D_LOC283177
## 0.1892102 0.1845207 0.1837910
## D_ENSG00000249921 R_ALG1L2 R_MGP
## 0.1669278 0.1647497 0.1480861
## R_SH3GL1P1 subst_CCR3 subst_BATF2
## 0.1457765 0.1432728 0.1375155
## D_ENSG00000271936 R_B3GALT2 subst_LINC00106
## 0.1345911 0.1307544 0.1305171
## subst_BNC2 D_LOC644090
## 0.1284488 0.1256283
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap: In the first heatmap, the patients (in rows) are ordered by group and the 20 top genes (in columns) are ordered by importance in the random forest model. In the second heatmap, the patients and 20 top genes are clustered hierarchically: the patients that are most similar are closer to each other, same for the genes.
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between tolerant and non tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), even though the two groups are quite mixed, which suggests that it might still be difficult to classify the patients into tolerant and non tolerant groups based on the features we selected.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true non_tolerant tolerant
## non_tolerant 5 14
## tolerant 7 36
## [1] "prediction error: 33.8709677419355"
The classification error is quite high, but we can still see which features were selected in the model to classify tolerant from non tolerant patients in the Cryostem cohort:
## D_NUAK1 R_UBE2Q2P2 R_ARHGAP20
## 0.4578604 0.3333479 0.3177524
## subst_LINC00106 subst_BATF2 R_ENSG00000144642
## 0.3116562 0.2810177 0.2742881
## R_MGP subst_LOC100289473 R_B3GALT2
## 0.2577822 0.2435938 0.2293077
## D_CCR9 subst_C2 subst_ENSG00000260528
## 0.2292191 0.2253230 0.2240433
## R_SNORA13 subst_SLC41A2 R_FHAD1
## 0.2188952 0.2186340 0.2102651
## subst_ENSG00000233013 subst_TNR subst_ARHGAP20
## 0.2047453 0.1890582 0.1880447
## D_IL5RA R_ENSG00000234902
## 0.1778272 0.1700615
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap: In the first heatmap, the patients (in rows) are ordered by group and the 20 top genes (in columns) are ordered by importance in the random forest model. In the second heatmap, the patients and 20 top genes are clustered hierarchically: the patients that are most similar are closer to each other, same for the genes.
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to have lower PC1 values on the PCA (they are situated more on the left), but the two groups are still quite mixed. We can still see which genes are driving the first PC:
pca_all$rotation[which(abs(pca_all$rotation[,"PC1"])>0.15),c(1)]
## subst_PPIAL4C subst_ENSG00000242134 subst_ENSG00000267257
## 0.1676553 0.1559545 -0.1912630
## subst_LINC01727 subst_ID1
## 0.1666386 -0.1737556
It seems like the differences between donors and recipients in these 5 genes mainly drive the 1st PC.
We also see a separation of the patients in a top and a low group. We can see which genes are mainly driving the second axis of the PCA:
pca_all$rotation[which(abs(pca_all$rotation[,"PC2"])>0.1),c(1,2)]
## PC1 PC2
## subst_USP9Y 0.04015393 -0.1828991
## D_UTY 0.13741244 -0.9483429
It seems like the expression of the gene UTY in the donors is mainly driving the separation into two groups in the PCA:
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 26 3
## secondary_tolerant 8 6
## [1] "prediction error: 25.5813953488372"
We can see which features were selected in the model to classify tolerant from non tolerant patients in the St Louis cohort:
## subst_MIR4458HG subst_ENSG00000219409 subst_ENSG00000242798
## 0.5275813 0.5118513 0.4308104
## subst_ZNF890P subst_ENSG00000248785 subst_ENSG00000235672
## 0.4165941 0.3586246 0.3122848
## subst_SCIN subst_PLD6 subst_ENSG00000199977
## 0.2937397 0.2918012 0.2904153
## D_UTY D_LINC01225 subst_ENSG00000267257
## 0.2669379 0.2617474 0.2610529
## subst_LOC105369536 subst_AREG subst_MFSD2B
## 0.2585041 0.2439663 0.2427976
## subst_FMN1 subst_ENSG00000211795 subst_LINC00243
## 0.2315007 0.2311396 0.2305511
## subst_ENSG00000242607 subst_RPA4
## 0.2276026 0.2210946
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap:
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), which suggests that we might be able to classify primary from secondary tolerant patients based on the features we selected.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 26 3
## secondary_tolerant 8 6
## [1] "prediction error: 25.5813953488372"
We can see which features were selected in the model to classify tolerant from non tolerant patients in the St Louis cohort:
## subst_MIR4458HG subst_ENSG00000219409 subst_THBS4
## 0.2467427 0.2385918 0.2058012
## D_MEG3 subst_ENSG00000256673 R_ENSG00000257246
## 0.2041543 0.1929848 0.1919989
## subst_ENSG00000242798 subst_ENSG00000279347 subst_ZNF890P
## 0.1919602 0.1917726 0.1847501
## subst_ENSG00000220867 subst_ZNF311 subst_LINC00964
## 0.1792660 0.1685747 0.1598209
## subst_ENSG00000238259 subst_ENSG00000248785 subst_ENSG00000267257
## 0.1479076 0.1401106 0.1388079
## D_UTY subst_LOC105369536 D_ENSG00000272369
## 0.1318366 0.1304484 0.1263769
## subst_ENSG00000199977 subst_ENSG00000235672
## 0.1222037 0.1181425
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap:
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
We first combine the features that were selected as highly informative to classify between primary and secondary tolerant patients from the couples, recipients and donors into one big table:
We can generate a PCA and a tSNE using all these features to see how the tolerant and non tolerant patients are represented in the reduced dimensions:
The 2 first components of the PCA seem to hold quite a lot of the variability in the data:
plot(pca_all)
The primary tolerant patients seem to have higher PC1 values on the PCA (they are situated more on the right), which suggests that we might be able to classify primary from secondary tolerant patients based on the features we selected.
We can also generate a tSNE on the same data:
We can see if building a model using all these features combined leads to good classification results between tolerant and non tolerant patients:
## predicted
## true primary_tolerant secondary_tolerant
## primary_tolerant 29 0
## secondary_tolerant 9 5
## [1] "prediction error: 20.9302325581395"
We can see which features were selected in the model to classify tolerant from non tolerant patients in the St Louis cohort:
## D_MEG3 subst_ENSG00000279347 R_ASAP3
## 1.0008614 0.9424567 0.6009436
## subst_RPS6KA6 R_KLRC2 R_ENSG00000233597
## 0.5784761 0.5780665 0.5576253
## D_LINC01225 D_AGAP12P subst_FMN1
## 0.5244211 0.4968772 0.4845347
## R_PCLAF R_FAM83D subst_PCLAF
## 0.4700820 0.4590246 0.4009827
## R_PRUNE2 R_AGMAT R_UBE2C
## 0.3888723 0.3778133 0.3681311
## subst_CDC25A D_TMCC2 R_HACD1
## 0.3506655 0.3415350 0.3370150
## subst_ENSG00000228232 R_ENSG00000210107
## 0.3255356 0.3177950
The variables that were mainly selected as informative in the model are the ones from recipients and couples.
We can visualise the top variables in a heatmap:
Reminder: high values for the variables that start with “subst_” mean that the difference of expression between the donor and the recipient of the same couple was high for this specific feature.
Summary of the number of significative genes found overall, between non tolerant and tolerant patients:
r_tnt_95_louis <- read.xlsx("../data/RNAseq/recip_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
r_tnt_95_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/recip_only/rna_recip_cryo_TNT_95.RData")
r_tnt_95_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_TNT_95_thresh.xlsx")
summary_genes <- as.data.frame(t(c("Recip_TNT", 0.95, length(r_tnt_95_louis$X1),
length(r_tnt_95_cryo$X1), (ncol(rna_recip_cryo_TNT_95)-1),
length(r_tnt_95_beh$sel_beh))))
r_tnt_90_louis <- read.xlsx("../data/RNAseq/recip_only/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
r_tnt_90_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/recip_only/rna_recip_cryo_TNT_90.RData")
r_tnt_90_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_TNT_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Recip_TNT", 0.90, length(r_tnt_90_louis$X1),
length(r_tnt_90_cryo$X1), (ncol(rna_recip_cryo_TNT_90)-1),
length(r_tnt_90_beh$sel_beh)))))
d_tnt_95_louis <- read.xlsx("../data/RNAseq/donors_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
d_tnt_95_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_only/rna_donors_cryo_TNT_95.RData")
d_tnt_95_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_TNT_95_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Donors_TNT", 0.95, length(d_tnt_95_louis$X1),
length(d_tnt_95_cryo$X1), ncol(rna_donors_cryo_TNT_95)-1,
length(d_tnt_95_beh$sel_beh)))))
d_tnt_90_louis <- read.xlsx("../data/RNAseq/donors_only/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
d_tnt_90_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_only/rna_donors_cryo_TNT_90.RData")
d_tnt_90_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_TNT_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Donors_TNT", 0.90, length(d_tnt_90_louis$X1),
length(d_tnt_90_cryo$X1), ncol(rna_donors_cryo_TNT_90)-1,
length(d_tnt_90_beh$sel_beh)))))
dr_tnt_95_louis <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
dr_tnt_95_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_TNT_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_TNT_95.RData")
dr_tnt_95_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_95_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("D-R_TNT", 0.95, length(dr_tnt_95_louis$X1),
length(dr_tnt_95_cryo$X1), ncol(rna_dr_cryo_TNT_95)-1,
length(dr_tnt_95_beh$sel_beh)))))
dr_tnt_90_louis <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
dr_tnt_90_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_TNT_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_TNT_90.RData")
dr_tnt_90_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_TNT_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("D-R_TNT", 0.90, length(dr_tnt_90_louis$X1),
length(dr_tnt_90_cryo$X1), ncol(rna_dr_cryo_TNT_90)-1,
length(dr_tnt_90_beh$sel_beh)))))
colnames(summary_genes) <- c("type", "threshold", "St Louis", "Cryostem", "both", "behaviour")
summary_genes %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| type | threshold | St Louis | Cryostem | both | behaviour |
|---|---|---|---|---|---|
| Recip_TNT | 0.95 | 1876 | 1264 | 140 | 111 |
| Recip_TNT | 0.9 | 2647 | 2492 | 408 | 278 |
| Donors_TNT | 0.95 | 1312 | 675 | 70 | 5 |
| Donors_TNT | 0.9 | 2592 | 1258 | 310 | 22 |
| D-R_TNT | 0.95 | 1406 | 476 | 68 | 28 |
| D-R_TNT | 0.9 | 2216 | 1046 | 196 | 87 |
Summary of the number of significative genes found overall, between primary and secondary tolerant patients:
r_ps_95_louis <- read.xlsx("../data/RNAseq/recip_only/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
r_ps_95_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/recip_only/rna_recip_cryo_PS_95.RData")
r_ps_95_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_PS_95_thresh.xlsx")
summary_genes <- as.data.frame(t(c("Recip_PS", 0.95, length(r_ps_95_louis$X1),
length(r_ps_95_cryo$X1), (ncol(rna_recip_cryo_PS_95)-1),
length(r_ps_95_beh$sel_beh))))
r_ps_90_louis <- read.xlsx("../data/RNAseq/recip_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
r_ps_90_cryo <- read.xlsx("../data/RNA_seq_national/recip_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/recip_only/rna_recip_cryo_PS_90.RData")
r_ps_90_beh <- read.xlsx("../data/RNAseq/recip_only/perm_val_beh_PS_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Recip_PS", 0.90, length(r_ps_90_louis$X1),
length(r_ps_90_cryo$X1), (ncol(rna_recip_cryo_PS_90)-1),
length(r_ps_90_beh$sel_beh)))))
d_ps_95_louis <- read.xlsx("../data/RNAseq/donors_only/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
d_ps_95_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_only/rna_donors_cryo_PS_95.RData")
d_ps_95_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_95_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Donors_PS", 0.95, length(d_ps_95_louis$X1),
length(d_ps_95_cryo$X1), ncol(rna_donors_cryo_PS_95)-1,
length(d_ps_95_beh$sel_beh)))))
d_ps_90_louis <- read.xlsx("../data/RNAseq/donors_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
d_ps_90_cryo <- read.xlsx("../data/RNA_seq_national/donors_only/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_only/rna_donors_cryo_PS_90.RData")
d_ps_90_beh <- read.xlsx("../data/RNAseq/donors_only/perm_val_beh_PS_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("Donors_PS", 0.90, length(d_ps_90_louis$X1),
length(d_ps_90_cryo$X1), ncol(rna_donors_cryo_PS_90)-1,
length(d_ps_90_beh$sel_beh)))))
dr_ps_95_louis <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
dr_ps_95_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_PS_95_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_PS_95.RData")
dr_ps_95_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_95_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("D-R_PS", 0.95, length(dr_ps_95_louis$X1),
length(dr_ps_95_cryo$X1), ncol(rna_dr_cryo_PS_95)-1,
length(dr_ps_95_beh$sel_beh)))))
dr_ps_90_louis <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
dr_ps_90_cryo <- read.xlsx("../data/RNA_seq_national/donors_and_recip/perm_val_PS_90_thresh.xlsx", colNames = FALSE)
load("../data/RNA_seq_national/donors_and_recip/rna_dr_cryo_PS_90.RData")
dr_ps_90_beh <- read.xlsx("../data/RNAseq/donors_and_recip/perm_val_beh_PS_90_thresh.xlsx")
summary_genes <- rbind(summary_genes,
as.data.frame(t(c("D-R_PS", 0.90, length(dr_ps_90_louis$X1),
length(dr_ps_90_cryo$X1), ncol(rna_dr_cryo_PS_90)-1,
length(dr_ps_90_beh$sel_beh)))))
colnames(summary_genes) <- c("type", "threshold", "St Louis", "Cryostem", "both", "behaviour")
summary_genes %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| type | threshold | St Louis | Cryostem | both | behaviour |
|---|---|---|---|---|---|
| Recip_PS | 0.95 | 343 | 367 | 3 | 1 |
| Recip_PS | 0.9 | 723 | 907 | 45 | 24 |
| Donors_PS | 0.95 | 810 | 441 | 26 | 6 |
| Donors_PS | 0.9 | 1498 | 799 | 104 | 16 |
| D-R_PS | 0.95 | 627 | 1320 | 111 | 6 |
| D-R_PS | 0.9 | 1192 | 2211 | 337 | 15 |